Sockets and Networking

The Incident: Building HTTP from a Raw Socket

The best way to understand what the requests library does for you is to write an HTTP client without it. Here is a complete HTTP GET request to httpbin.org using only Python's socket module:

import socket

HOST = "httpbin.org"
PORT = 80
PATH = "/get"

# 1. Create a TCP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(10)

# 2. Resolve the hostname to an IP address
addr = socket.getaddrinfo(HOST, PORT, socket.AF_INET, socket.SOCK_STREAM)
ip = addr[0][4][0]
print(f"Resolved {HOST} -> {ip}")

# 3. Establish a TCP connection
sock.connect((ip, PORT))

# 4. Send an HTTP/1.1 GET request
request = (
    f"GET {PATH} HTTP/1.1\r\n"
    f"Host: {HOST}\r\n"
    f"Connection: close\r\n"
    f"User-Agent: raw-socket-client/1.0\r\n"
    f"\r\n"
)
sock.sendall(request.encode("utf-8"))

# 5. Receive the response - may require multiple recv() calls
response_parts = []
while True:
    chunk = sock.recv(4096)
    if not chunk:
        break
    response_parts.append(chunk)

response = b"".join(response_parts).decode("utf-8", errors="replace")

# 6. Close the connection
sock.close()

# Parse the response
header_section, _, body = response.partition("\r\n\r\n")
status_line = header_section.split("\r\n")[0]
print(f"Status: {status_line}")
print(f"Body (first 200 chars): {body[:200]}")

The requests library does exactly this - plus: connection pooling, TLS, redirect following, cookie management, timeout handling, and a clean API. Every line of the raw version above maps to a requests abstraction. Knowing the raw version means you can debug TLS handshake failures, connection reset errors, and keepalive issues at their source.

The BSD Socket Lifecycle

A server socket follows a specific sequence of operations. Every framework - asyncio, tornado, gunicorn - executes this same sequence:

Server                                     Client
───────                                    ──────

socket()                                   socket()
  │  Create a socket endpoint              │
  ▼                                        │
bind()                                     │
  │  Attach to an address:port             │
  ▼                                        │
listen()                                   │
  │  Mark as passive socket,               │
  │  kernel maintains connection backlog   │
  ▼                                        ▼
accept()   ◄────── SYN ────────────────  connect()
  │         ────── SYN-ACK ──────────►     │
  │         ◄───── ACK ────────────────    │
  │  Returns new socket for this client   │
  ▼                                        ▼
recv()/send()  ◄─── data ──────────►  send()/recv()
  │                                        │
  ▼                                        ▼
close()    ◄────── FIN ────────────────  close()
           ────── FIN-ACK ──────────►
           ◄───── FIN ────────────────
           ────── ACK ──────────────►

The socket() call creates a socket file descriptor. bind() attaches it to a local address. listen() puts it in passive mode - the kernel will complete TCP handshakes on behalf of the process and queue connections. accept() dequeues one completed connection and returns a new socket FD for communication with that client. The listening socket remains open to accept more clients.

TCP Server from Scratch

import socket
import threading

HOST = "0.0.0.0"
PORT = 9000

def handle_client(conn: socket.socket, addr: tuple) -> None:
    """Handle a single client connection in its own thread."""
    print(f"Connection from {addr[0]}:{addr[1]}")
    try:
        while True:
            data = conn.recv(1024)
            if not data:
                # Empty recv means the client closed the connection (FIN received)
                break
            print(f"  Received {len(data)} bytes from {addr[0]}:{addr[1]}")
            # Echo the data back
            conn.sendall(data)
    except ConnectionResetError:
        print(f"  {addr} reset the connection")
    except socket.timeout:
        print(f"  {addr} timed out")
    finally:
        conn.close()
        print(f"Connection from {addr} closed")


def run_server():
    # AF_INET = IPv4, SOCK_STREAM = TCP (reliable, ordered, connection-oriented)
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    # SO_REUSEADDR: allows reuse of the address immediately after server restart
    # Without this, bind() fails with "Address already in use" for ~60 seconds
    # after the previous server exits (TIME_WAIT state)
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

    server.bind((HOST, PORT))

    # backlog=128: the kernel queues up to 128 completed connections
    # waiting for accept(). Connections beyond this are dropped with RST.
    server.listen(128)
    print(f"Echo server listening on {HOST}:{PORT}")

    try:
        while True:
            # accept() blocks until a client connects
            # Returns (new_socket, (client_ip, client_port))
            conn, addr = server.accept()

            # Set a 30-second read timeout on the client socket
            conn.settimeout(30.0)

            # Spawn a thread for each client - simple but doesn't scale beyond ~1000
            t = threading.Thread(target=handle_client, args=(conn, addr), daemon=True)
            t.start()
    except KeyboardInterrupt:
        print("\nShutting down server")
    finally:
        server.close()


if __name__ == "__main__":
    run_server()

TCP Client from Scratch

import socket

def tcp_client(host: str, port: int, message: str) -> str:
    """Send a message over TCP and return the echoed response."""
    # Create socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    # Set connection timeout (separate from read/write timeout)
    sock.settimeout(5.0)

    try:
        # Connect - triggers TCP three-way handshake
        sock.connect((host, port))

        # Send all bytes (sendall loops internally until all bytes are sent)
        sock.sendall(message.encode("utf-8"))

        # Signal end of our send stream - sends TCP FIN in one direction
        # Server sees EOF when reading, but can still write back to us
        sock.shutdown(socket.SHUT_WR)

        # Receive the response - loop because TCP is a stream
        # recv() may return less than the full response
        chunks = []
        while True:
            chunk = sock.recv(4096)
            if not chunk:
                break   # server closed the connection
            chunks.append(chunk)

        return b"".join(chunks).decode("utf-8")

    except socket.timeout:
        raise TimeoutError(f"Connection to {host}:{port} timed out")
    except ConnectionRefusedError:
        raise ConnectionError(f"Connection refused by {host}:{port}")
    finally:
        sock.close()

# Test against our echo server
response = tcp_client("127.0.0.1", 9000, "Hello, systems programming!")
print(f"Echo: {response}")

Socket Options

Socket options control low-level TCP/IP behavior. They are set with setsockopt():

import socket
import struct

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# SO_REUSEADDR (level: SOL_SOCKET)
# Allows bind() to reuse a port in TIME_WAIT state.
# Essential for any server - without it, restart fails for ~60 seconds.
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# SO_REUSEPORT (level: SOL_SOCKET, Linux 3.9+)
# Allows multiple sockets to bind to the same port.
# The kernel load-balances incoming connections across them.
# Used by nginx worker processes and modern asyncio servers.
if hasattr(socket, "SO_REUSEPORT"):
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)

# SO_KEEPALIVE
# Enables TCP keepalive probes - the kernel sends probes if the connection
# is idle. Detects dead peers (crashed machines, disconnected cables)
# without application-level heartbeats.
sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
# On Linux, configure keepalive timing:
if hasattr(socket, "TCP_KEEPIDLE"):
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE,  60)   # idle before first probe (s)
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10)   # interval between probes (s)
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT,    5)   # max probes before giving up

# TCP_NODELAY (level: IPPROTO_TCP)
# Disables Nagle's algorithm. Nagle buffers small sends and waits up to 200ms
# for more data to coalesce into a single segment. For latency-sensitive
# protocols (Redis, HTTP/2, gRPC), disable it.
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

# SO_LINGER
# Controls behavior on close() when unsent data remains.
# linger=0: close() sends RST immediately, discards unsent data (abortive close)
# linger=1, timeout=5: close() blocks up to 5 seconds waiting for FIN-ACK
linger_struct = struct.pack("ii", 1, 5)   # (l_onoff=1, l_linger=5)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, linger_struct)

# SO_RCVBUF / SO_SNDBUF
# Override kernel's socket receive/send buffer sizes.
# Default is ~212 KB on Linux. For high-throughput, increase to ~4 MB.
sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 4 * 1024 * 1024)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4 * 1024 * 1024)

# Check current buffer size (kernel may cap it at net.core.rmem_max)
actual_rcvbuf = sock.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
print(f"Actual receive buffer: {actual_rcvbuf} bytes")

Blocking vs Non-Blocking Sockets

By default, socket operations block: accept() waits until a client connects, recv() waits until data arrives. Non-blocking sockets return immediately with BlockingIOError (errno EAGAIN/EWOULDBLOCK) if the operation would block.

import socket
import select
import errno

# Create a non-blocking server socket
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(("0.0.0.0", 9001))
server.listen(128)
server.setblocking(False)   # or: server.settimeout(0)

clients = []

# select()-based I/O multiplexing - the original POSIX interface
# Works on any platform but limited to 1024 FDs (FD_SETSIZE)
while True:
    read_fds = [server] + clients
    readable, _, _ = select.select(read_fds, [], [], 1.0)  # 1s timeout

    for fd in readable:
        if fd is server:
            try:
                conn, addr = server.accept()
                conn.setblocking(False)
                clients.append(conn)
                print(f"New client: {addr}")
            except BlockingIOError:
                pass  # should not happen since select said readable
        else:
            try:
                data = fd.recv(1024)
                if data:
                    fd.sendall(data)   # echo
                else:
                    # Client disconnected
                    fd.close()
                    clients.remove(fd)
            except BlockingIOError:
                pass  # no data right now - try again next iteration
            except ConnectionResetError:
                fd.close()
                clients.remove(fd)

`select.poll()`: No FD Limit, Linux Only

import select
import socket

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(("0.0.0.0", 9002))
server.listen(128)
server.setblocking(False)

poller = select.poll()
# POLLIN: ready to read; POLLHUP: remote closed; POLLERR: error
poller.register(server.fileno(), select.POLLIN)

fd_to_socket = {server.fileno(): server}

while True:
    events = poller.poll(1000)   # 1000ms timeout
    for fd, event in events:
        sock = fd_to_socket[fd]
        if sock is server:
            conn, addr = server.accept()
            conn.setblocking(False)
            fd_to_socket[conn.fileno()] = conn
            poller.register(conn.fileno(), select.POLLIN | select.POLLHUP)
            print(f"Connected: {addr}")
        elif event & (select.POLLHUP | select.POLLERR):
            poller.unregister(fd)
            sock.close()
            del fd_to_socket[fd]
        elif event & select.POLLIN:
            data = sock.recv(4096)
            if data:
                sock.sendall(data)
            else:
                poller.unregister(fd)
                sock.close()
                del fd_to_socket[fd]

`epoll`-Based Event Loop: What asyncio Builds On

epoll is the Linux-specific, O(1) I/O notification mechanism. Unlike select (O(n) scan) and poll (O(n) scan), epoll maintains an interest list in the kernel and returns only the FDs that are actually ready. This is what asyncio, tornado, and nginx are built on.

import select
import socket
import os

def run_epoll_echo_server(host: str = "0.0.0.0", port: int = 9003):
    """
    A minimal echo server using epoll.
    This is essentially what asyncio's SelectorEventLoop does
    before wrapping everything in coroutines.
    """
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server.bind((host, port))
    server.listen(128)
    server.setblocking(False)

    # Create an epoll file descriptor
    # This is an fd in itself - kernel maintains the interest list
    ep = select.epoll()

    # Register the server socket for read events
    # EPOLLIN: data available to read / connection ready to accept
    # EPOLLET: edge-triggered mode (only notified on state changes, not level)
    ep.register(server.fileno(), select.EPOLLIN)

    connections = {}    # fd -> socket
    addresses = {}      # fd -> (ip, port)

    print(f"epoll echo server on {host}:{port} (PID {os.getpid()})")

    try:
        while True:
            # epoll_wait: blocks until at least one event is ready
            # Returns list of (fd, event_mask) pairs
            # timeout=-1 means wait indefinitely; timeout=1.0 means 1 second
            events = ep.poll(timeout=1.0)

            for fd, event in events:
                if fd == server.fileno():
                    # New connection
                    conn, addr = server.accept()
                    conn.setblocking(False)
                    ep.register(conn.fileno(), select.EPOLLIN | select.EPOLLRDHUP)
                    connections[conn.fileno()] = conn
                    addresses[conn.fileno()] = addr
                    print(f"  Accept: {addr}")

                elif event & (select.EPOLLERR | select.EPOLLHUP | select.EPOLLRDHUP):
                    # Connection closed or error
                    ep.unregister(fd)
                    connections[fd].close()
                    print(f"  Closed: {addresses.get(fd)}")
                    del connections[fd]
                    del addresses[fd]

                elif event & select.EPOLLIN:
                    # Data ready to read
                    conn = connections[fd]
                    try:
                        data = conn.recv(4096)
                        if data:
                            conn.sendall(data)
                        else:
                            ep.unregister(fd)
                            conn.close()
                            print(f"  EOF: {addresses.get(fd)}")
                            del connections[fd]
                            del addresses[fd]
                    except (ConnectionResetError, BrokenPipeError):
                        ep.unregister(fd)
                        connections[fd].close()
                        del connections[fd]
                        del addresses[fd]

    except KeyboardInterrupt:
        print("\nShutting down")
    finally:
        ep.close()
        server.close()
        for conn in connections.values():
            conn.close()


if __name__ == "__main__":
    run_epoll_echo_server()

The kernel data flow for epoll:

User space                     Kernel space
──────────                     ────────────

ep = epoll()          ──────►  Allocate epoll instance
                               (an eventpoll struct with two data structures:
                                - rbr: red-black tree of interest list
                                - rdllist: ready list)

ep.register(fd, EPOLLIN) ───►  Add fd to red-black tree

                               [Network driver receives packet]
                               [TCP stack processes it]
                               [Moves fd from rbr to rdllist]

events = ep.poll()    ──────►  epoll_wait syscall
                               If rdllist non-empty: return events immediately
                               If empty: sleep until event or timeout

                               Returns [(fd1, event1), (fd2, event2), ...]

UDP Sockets

UDP is connectionless - no handshake, no guaranteed delivery, no ordering. Each sendto() produces one datagram; each recvfrom() receives one datagram. Faster than TCP for use cases where occasional packet loss is acceptable: DNS, metrics, game state updates, logging.

import socket

# UDP server
def udp_echo_server(host: str = "0.0.0.0", port: int = 9004):
    # SOCK_DGRAM = UDP
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.bind((host, port))
    print(f"UDP echo server on {host}:{port}")

    while True:
        # recvfrom returns (data, (sender_ip, sender_port))
        data, addr = sock.recvfrom(65535)
        print(f"UDP from {addr}: {data.decode()!r}")
        # sendto: specify destination for each datagram
        sock.sendto(data, addr)


# UDP client
def udp_client(host: str, port: int, message: str) -> str:
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(2.0)
    try:
        sock.sendto(message.encode(), (host, port))
        response, _ = sock.recvfrom(65535)
        return response.decode()
    finally:
        sock.close()

# UDP broadcast (LAN discovery)
def send_broadcast(port: int, message: str):
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
    sock.sendto(message.encode(), ("<broadcast>", port))
    sock.close()

Unix Domain Sockets: Faster Local IPC

Unix domain sockets (AF_UNIX) communicate between processes on the same machine via the filesystem (for SOCK_STREAM) or an abstract namespace (Linux). They are significantly faster than TCP loopback because there is no IP stack processing - no checksums, no TTL, no routing. Docker's containerd, PostgreSQL, and Redis all use Unix domain sockets for local connections.

import socket
import os

SOCKET_PATH = "/tmp/myapp.sock"

# Server - SOCK_STREAM (connection-oriented, like TCP)
def unix_socket_server():
    # Remove stale socket file if it exists
    if os.path.exists(SOCKET_PATH):
        os.unlink(SOCKET_PATH)

    server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    server.bind(SOCKET_PATH)
    server.listen(10)
    # Set permissions: only owner can connect
    os.chmod(SOCKET_PATH, 0o600)

    print(f"Unix socket server on {SOCKET_PATH}")
    conn, _ = server.accept()
    try:
        while True:
            data = conn.recv(4096)
            if not data:
                break
            conn.sendall(data)
    finally:
        conn.close()
        server.close()
        os.unlink(SOCKET_PATH)


# Client
def unix_socket_client(message: str) -> str:
    sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    sock.connect(SOCKET_PATH)
    try:
        sock.sendall(message.encode())
        sock.shutdown(socket.SHUT_WR)
        return sock.recv(4096).decode()
    finally:
        sock.close()


# Abstract namespace (Linux only) - no filesystem entry, auto-cleaned on close
def abstract_socket_example():
    # Prefix with \0 for abstract namespace
    abstract_addr = "\0myapp_socket"

    server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    server.bind(abstract_addr)
    server.listen(5)
    # No socket file to clean up - disappears when all FDs referencing it are closed
    return server

Performance comparison - Unix socket vs TCP loopback for 1 million small messages (measured on Linux 5.15):

Unix domain socket:   ~0.8 μs per message round-trip
TCP loopback:         ~3.2 μs per message round-trip

Throughput (1 KB messages):
Unix domain socket:   ~2.1 GB/s
TCP loopback:         ~0.9 GB/s

The difference is the TCP stack: checksums, TCP state machine, and IP header processing all run on the loopback path. Unix sockets skip all of it.

A Minimal HTTP/1.1 Server (Raw Sockets)

import socket
import os
import time
import threading
from pathlib import Path

WEBROOT = "/tmp/www"
os.makedirs(WEBROOT, exist_ok=True)
# Create a test file
Path(f"{WEBROOT}/index.html").write_text("<h1>Hello from raw socket HTTP server</h1>")

STATUS_MESSAGES = {
    200: "OK",
    404: "Not Found",
    405: "Method Not Allowed",
    500: "Internal Server Error",
}

def send_response(conn: socket.socket, status: int, body: bytes,
                  content_type: str = "text/html") -> None:
    msg = STATUS_MESSAGES.get(status, "Unknown")
    headers = (
        f"HTTP/1.1 {status} {msg}\r\n"
        f"Content-Type: {content_type}; charset=utf-8\r\n"
        f"Content-Length: {len(body)}\r\n"
        f"Connection: close\r\n"
        f"Server: raw-python/1.0\r\n"
        f"Date: {time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime())}\r\n"
        f"\r\n"
    )
    conn.sendall(headers.encode("utf-8") + body)


def handle_http(conn: socket.socket, addr: tuple) -> None:
    try:
        raw = b""
        while b"\r\n\r\n" not in raw:
            chunk = conn.recv(4096)
            if not chunk:
                return
            raw += chunk
            if len(raw) > 8192:
                send_response(conn, 400, b"Request too large")
                return

        request_line = raw.split(b"\r\n")[0].decode("utf-8", errors="replace")
        parts = request_line.split()
        if len(parts) < 3:
            send_response(conn, 400, b"Bad request")
            return

        method, path, _ = parts[0], parts[1], parts[2]

        if method != "GET":
            send_response(conn, 405, b"Only GET is supported")
            return

        # Sanitize path - prevent directory traversal
        safe_path = os.path.normpath(path.lstrip("/"))
        if safe_path.startswith(".."):
            send_response(conn, 403, b"Forbidden")
            return

        file_path = os.path.join(WEBROOT, safe_path)
        if os.path.isdir(file_path):
            file_path = os.path.join(file_path, "index.html")

        if not os.path.isfile(file_path):
            send_response(conn, 404, f"Not found: {path}".encode())
            return

        with open(file_path, "rb") as f:
            body = f.read()

        content_type = "text/html" if file_path.endswith(".html") else "application/octet-stream"
        send_response(conn, 200, body, content_type)
        print(f"  200 GET {path} ({len(body)} bytes) from {addr[0]}")

    except Exception as e:
        try:
            send_response(conn, 500, str(e).encode())
        except Exception:
            pass
    finally:
        conn.close()


def run_http_server(host: str = "0.0.0.0", port: int = 8080):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server.bind((host, port))
    server.listen(128)
    print(f"HTTP server on http://{host}:{port}")
    print(f"Serving files from {WEBROOT}")

    try:
        while True:
            conn, addr = server.accept()
            conn.settimeout(10.0)
            t = threading.Thread(target=handle_http, args=(conn, addr), daemon=True)
            t.start()
    except KeyboardInterrupt:
        print("\nServer stopped")
    finally:
        server.close()


if __name__ == "__main__":
    run_http_server()

TLS with the `ssl` Module

import ssl
import socket

# Wrap a client socket with TLS
def https_get(host: str, path: str = "/") -> str:
    # Create the SSL context with sensible defaults
    ctx = ssl.create_default_context()

    # Connect to port 443 and wrap the socket
    with socket.create_connection((host, 443), timeout=10) as raw_sock:
        with ctx.wrap_socket(raw_sock, server_hostname=host) as tls_sock:
            # At this point TLS handshake is complete
            print(f"TLS version: {tls_sock.version()}")
            print(f"Cipher: {tls_sock.cipher()}")
            cert = tls_sock.getpeercert()
            print(f"Server cert subject: {dict(x[0] for x in cert['subject'])}")

            request = (
                f"GET {path} HTTP/1.1\r\n"
                f"Host: {host}\r\n"
                f"Connection: close\r\n"
                f"\r\n"
            )
            tls_sock.sendall(request.encode())

            chunks = []
            while True:
                chunk = tls_sock.recv(4096)
                if not chunk:
                    break
                chunks.append(chunk)

    return b"".join(chunks).decode("utf-8", errors="replace")

# result = https_get("httpbin.org", "/get")
# print(result[:500])


# Server-side TLS
def tls_server_example():
    ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
    ctx.load_cert_chain(certfile="server.crt", keyfile="server.key")
    ctx.minimum_version = ssl.TLSVersion.TLSv1_2

    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server.bind(("0.0.0.0", 8443))
    server.listen(10)

    with ctx.wrap_socket(server, server_side=True) as tls_server:
        print("TLS server listening on :8443")
        conn, addr = tls_server.accept()
        print(f"TLS connection from {addr}")
        data = conn.recv(1024)
        conn.sendall(data)
        conn.close()

Binary Protocol Framing with `struct`

Raw TCP is a byte stream - there are no message boundaries. The kernel can merge or split your data into arbitrary chunks. You must implement message framing. The standard approach: a fixed-size length prefix.

import socket
import struct

# Message format:
# ┌──────────┬────────────────┐
# │ 4 bytes  │  N bytes       │
# │ length   │  payload       │
# │ (uint32) │  (arbitrary)   │
# └──────────┴────────────────┘

HEADER_FORMAT = "!I"    # network byte order, unsigned int (4 bytes)
HEADER_SIZE = struct.calcsize(HEADER_FORMAT)


def send_message(sock: socket.socket, data: bytes) -> None:
    """Send a length-prefixed message."""
    header = struct.pack(HEADER_FORMAT, len(data))
    sock.sendall(header + data)


def recv_message(sock: socket.socket) -> bytes:
    """Receive a length-prefixed message, handling partial recv()."""
    # Read exactly HEADER_SIZE bytes
    header_data = recv_exactly(sock, HEADER_SIZE)
    (length,) = struct.unpack(HEADER_FORMAT, header_data)

    if length > 64 * 1024 * 1024:   # 64 MB sanity limit
        raise ValueError(f"Message too large: {length} bytes")

    # Read exactly `length` bytes of payload
    return recv_exactly(sock, length)


def recv_exactly(sock: socket.socket, n: int) -> bytes:
    """Read exactly n bytes from a socket, handling short reads."""
    buf = bytearray(n)
    view = memoryview(buf)
    received = 0
    while received < n:
        count = sock.recv_into(view[received:], n - received)
        if count == 0:
            raise ConnectionError("Connection closed before all bytes received")
        received += count
    return bytes(buf)


# Example: a simple request-reply protocol
import json

def send_json(sock: socket.socket, obj) -> None:
    data = json.dumps(obj).encode("utf-8")
    send_message(sock, data)

def recv_json(sock: socket.socket):
    data = recv_message(sock)
    return json.loads(data.decode("utf-8"))

Interview Q&A

Q1: What happens inside the kernel when you call accept() on a listening TCP socket?

TCP is a connection-oriented protocol. Before your application calls accept(), the kernel has already completed the full three-way handshake on your behalf. When the kernel receives a SYN from a client, it sends SYN-ACK and moves the connection to the SYN-RECEIVED state, storing it in the socket's SYN queue (also called the incomplete queue). When the client's ACK arrives, the connection moves to ESTABLISHED and is placed in the accept queue (completed connections queue). The listen(backlog) parameter controls the maximum size of the accept queue. accept() simply dequeues one entry from the accept queue and creates a new socket file descriptor for communication with that specific client. If the accept queue is full (the application is slow to call accept()), new SYNs are dropped or RST is sent depending on kernel settings (tcp_abort_on_overflow). This is why accept() in a tight loop is important - you don't want to bottleneck the accept queue.

Q2: Explain the difference between select(), poll(), and epoll() in terms of performance characteristics and use cases.

All three are I/O multiplexing mechanisms that allow monitoring multiple file descriptors for readiness without blocking. select() uses a fixed-size bitset (fd_set) with a system-defined maximum of 1024 FDs (FD_SETSIZE). Each call copies the entire bitset from user to kernel space and the kernel scans all bits - O(n) in the number of monitored FDs. poll() removes the 1024 FD limit by using an array of pollfd structs that grows dynamically, but still copies the entire array each call and scans all entries - still O(n). epoll() (Linux-specific) maintains its interest list inside the kernel as a red-black tree. epoll_ctl() adds/removes/modifies individual FDs in O(log n). epoll_wait() returns only the FDs that are actually ready, never scans idle FDs - O(k) where k is the number of ready events, not the total registered FDs. For servers with 10,000+ connections where most are idle at any moment, epoll is orders of magnitude more efficient. asyncio's SelectorEventLoop uses epoll on Linux, kqueue on macOS (via the selectors module's DefaultSelector which picks the best available mechanism).

Q3: What is SO_REUSEADDR and why is it essential for TCP servers?

When a TCP server process terminates, the server port may remain in the TIME_WAIT state for up to 2 × MSL (Maximum Segment Lifetime, typically 60 seconds on Linux). TIME_WAIT exists to ensure any delayed packets from the closed connection don't corrupt a new connection on the same port. Without SO_REUSEADDR, trying to bind() to the same port within this window fails with EADDRINUSE. For a production server that restarts frequently (deployments, crashes), this 60-second window is unacceptable. Setting SO_REUSEADDR before bind() allows the socket to bind to a port even if a recent socket in TIME_WAIT exists on that port, as long as the IP address and client address differ. It does not bypass TIME_WAIT for the old connections - those still complete normally - it only allows the new server socket to bind immediately. The correct pattern is always: sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) immediately after socket() and before bind().

Q4: What is Nagle's algorithm and when should you disable it with TCP_NODELAY?

Nagle's algorithm (RFC 896) reduces network congestion by buffering small TCP writes: the sender waits until the previous sent segment is acknowledged or enough data accumulates for a full-size segment (MSS, typically 1460 bytes). The goal was to prevent "silly window syndrome" on slow networks - applications sending single-byte writes would generate a 40-byte TCP+IP header for every byte. In 1984, this was a serious problem. Today, it causes noticeable latency for interactive protocols. Redis, for example, disables Nagle because it sends small command packets and expects immediate replies - a 200ms buffering delay is catastrophic. HTTP/1.1 servers with pipelining benefit from disabling Nagle on the server socket. On the other hand, bulk data transfer (file uploads, video streaming) benefits from Nagle - it coalesces writes efficiently. The rule: if you send small messages and need low latency, set TCP_NODELAY. If you send large sequential data streams, leave Nagle enabled.

Q5: How do Unix domain sockets work at the kernel level, and why are they faster than TCP loopback?

Unix domain sockets (AF_UNIX) communicate through the kernel's VFS layer rather than the network stack. When process A calls send() on a Unix domain socket connected to process B, the kernel copies the data from A's buffer directly into B's socket receive buffer in kernel memory - no IP header, no TCP header, no checksum computation, no routing table lookup, no ethernet frame encapsulation. For SOCK_DGRAM (datagram) mode, the kernel can sometimes use zero-copy by passing a reference to A's pages directly. The filesystem path (e.g., /tmp/myapp.sock) is just a rendezvous mechanism - after accept(), the socket is purely an in-kernel object. On Linux, the abstract namespace (addresses beginning with \0) doesn't even create a filesystem entry. Benchmarks consistently show Unix socket round-trip latency at 0.5–1 μs vs 3–4 μs for TCP loopback, with proportionally higher throughput. The practical implication: always prefer Unix sockets for same-host IPC (database drivers, nginx-to-uwsgi, container-to-container on the same node).

The Incident: Building HTTP from a Raw Socket​

The BSD Socket Lifecycle​

TCP Server from Scratch​

TCP Client from Scratch​

Socket Options​

Blocking vs Non-Blocking Sockets​

select.poll(): No FD Limit, Linux Only​

epoll-Based Event Loop: What asyncio Builds On​

UDP Sockets​

Unix Domain Sockets: Faster Local IPC​

A Minimal HTTP/1.1 Server (Raw Sockets)​

TLS with the ssl Module​

Binary Protocol Framing with struct​

Interview Q&A​